arXiv:1509.06088vl [stat.ML] 21 Sep 2015 


Significance Analysis of High-Dimensional, 
Low-Sample Size Partially Labeled Data 

Qiyi Lu^ and Xingye Qiao 

Department of Mathematical Sciences 
Binghamton University, State University of New York 
Binghamton, New York, 13902-6000 

E-mails: {qlu,qiao}@math. binghamton.edu 
Phone: (607) 777-2147 
Fax: (607) 777-2450 

September 22, 2015 


Correspondence to: Qiyi Lu (e-mail: qlu@math.binghamton.edu). Qiyi Lu is a doctoral candidate and 
Xingye Qiao (e-mail: qiao@math.binghamton.edu) is an Assistant Professor at Department of Mathematical 
Sciences at Binghamton University, State University of New York, Binghamton, New York, 13902-6000. 
Qiao’s research is partially supported by a collaboration grant from Simons Foundation (award number 
246649). 


1 



Abstract 


Classification and clustering are both important topics in statistical learning. A 
natural question herein is whether predefined classes are really different from one an¬ 
other, or whether clusters are really there. Specifically, we may be interested in knowing 
whether the two classes defined by some class labels (when they are provided), or the 
two clusters tagged by a clustering algorithm (where class labels are not provided), are 
from the same underlying distribution. Although both are challenging questions for 
the high-dimensional, low-sample size data, there has been some recent development 
for both. However, when it is costly to manually place labels on observations, it is 
often that only a small portion of the class labels is available. In this article, we pro¬ 
pose a significance analysis approach for such type of data, namely partially labeled 
data. Our method makes use of the whole data and tries to test the class difference as 
if all the labels were observed. Compared to a testing method that ignores the label 
information, our method provides a greater power, meanwhile, maintaining the size, il¬ 
lustrated by a comprehensive simulation study. Theoretical properties of the proposed 
method are studied with emphasis on the high-dimensional, low-sample size setting. 
Our simulated examples help to understand when and how the information extracted 
from the labeled data can be effective. A real data example further illustrates the 
usefulness of the proposed method. 


KEY WORDS: Classification; Clustering; High-dimensional, low-sample size data; Hypoth¬ 
esis test; Semi-supervised learning. 


2 



1 Introduction 


Classification and clustering are both important tools in statistical learning. The availability 
of the class labels distinguish these two main domains. In classihcation, class labels are 
provided prior to the analysis, while they are unavailable in the clustering analysis. A 
natural statistical question regarding their use is whether classes/clusters are really there. 
In a setting where binary class labels are observed, we may be interested in testing whether 
the two classes are from the same distribution. Though often neglected, this is an important 
step before applying a classihcation algorithm. In standard statistical textbooks, there are 
many signihcance tests, such as two-sample t-test, one-way ANOVA, Hotelling’s test, 
and MANOVA. Among these, the two-sample t-test and ANOVA are univariate tests. The 
Hotelling’s test and MANOVA are multivariate tests, though both can fail when the 
dimension d is much greater than the sample size n. 

This problem of testing the difference between two classes becomes even more challeng¬ 
ing for the high-dimensional, low-sample size (HDLSS) data. The Hotelling’s test is very 
powerful when the dimension is smaller than the sample size. It is invariant under linear 
transformation. In addition, under the null hypothesis, the distribution of the statistic is 
known. However, the Hotelling’s statistic cannot be computed in the HDLSS setting 
because the sample covariance matrix is not invertible. There are efforts attempting to over¬ 
come this issue, including Dempster (1960), Bai and Saranadasa (1996), Srivastava and Du 
(2008) and Chen and Qin (2010). These methods use diagonalized versions of the covariance 
or inverse covariance matrices in the Hotelling’s statistic. There are many other treat¬ 
ments, such as Srivastava and Fujikoshi (2006), Schott (2007) and Srivastava (2007), which 
calibrate the distribution of some proposed statistic. In addition, the Direction-Projection- 
Permutation (DiProPerm) test (Wichers et ah, 2007, Wei et al., 2015) has been proved to 
be very effective for testing the class difference of the HDLSS data. 

Besides the difficulty brought from the high dimensionality, in many real problems, it 
is often the case that there are many observations that are left without class labels (the 
unlabeled data portion) in a data set. One reason is that it is often difficult or expensive to 


1 


obtain the class label information, while it may be relatively cheap to obtain the covariate 
information even for many observations. In such a situation, those aforementioned testing 
methods which require label information cannot be applied to the whole data set to test the 
class difference. As a consequence, one may have to forfeit the potentially useful information 
that resides in the unlabeled data. For instance, many cancer patients are categorized 
to certain cancer subtypes by radiologists through an inspection of the medical images. 
However, because of the high health care cost, medical images are easier to obtain than the 
actual diagnostic. Before a classihcation algorithm is used to design a data-mining-based 
early-screening machine (see, for example. Land et ah, 2012, Schaffer et al., 2012), a valuable 
question to ask is whether the so-called subtypes, many of which may be ad hoc or based on 
experience, are really there. 

One possible, but clearly flawed, solution to this problem is to treat all the data as 
unlabeled. In the unsupervised context, in the sense that there are no class labels provided 
for the analysis, clustering algorithms have been broadly applied in many helds. As to 
determine whether clusters are really there, several methods have been developed to assess 
the signihcance of clusters, including McShane et al. (2002), Tibshirani and Walther (2005), 
Suzuki and Shimodaira (2006) and Liu et al. (2008). However, these methods are not directly 
applicable for partially labeled data, unless one forfeits the potentially useful information 
that resides in the class labels which are available in the labeled data portion of the full data 
set. 

Hence, there seems to be a dilemma in testing partially labeled data: to ignore the un¬ 
labeled data completely (and apply a signihcance test for the labeled data only), versus, to 
ignore the class labels in the labeled data portion (and use a signihcance test for cluster¬ 
ing). Although each has its own applicability domain, neither looks promising for us. This 
motivates us to devise a signihcance testing method for the HDLSS partially labeled data. 
When class labels are partially provided, the unlabeled data are used to better estimate the 
sampling distribution. In the meantime, the class labels help to ehectively distinguish the 
two classes even if their diherence is small. Our proposed method is named Signihcance 


2 


Analysis of HDLSS Partially Labeled Data (SigPal). 


Single Gaussian 


DiProPerm: pval=0 




O O 




O o*^ o 


on % 

^ °°C0 s® 


op ° o 
^ " o , 
o < 


OOo 


° 

gCp o 
° % 
o o 
S o 
-8 

o o 


oo 


A ^ . 

.A ^ 


++ 


^ A A^ A +^“ ^ + 


, + 


- * 

“4^ + 


SigClust: pval=0.296 


SigPal: pval=0.018 


(§> 


° ° 

q, 8, 9o^%%^o 8 
og o ^ 


^ o 
o 

oo 

O o 


9^000 oo 
^ o O O o 


nOO 


%5 


o ^OO ^ 

° o °o° ‘o oSf° ° o 
° °° o°0(f ®&o'b ° 

O gesto o o ° o 

’°° 8 ^ 000 oo 
o O o O O 

o ° 

o° 

n°° o ° 

^o ° o 


A 


00 


Figure 1: The DiProPerm test is applicable when all class labels are known (top-right panel) 
while SigClust does not require any label information (bottom-left panel). The DiProPerm 
test correctly concludes that the two classes are indeed from two distributions (with p- 
value= 0), whereas the SigClust method fails to hnd this important difference (p-value= 
0.296). When the majority of the data are unlabeled with a small portion of labeled data, 
our proposed SigPal approach can give a signihcant conclusion with p-value= 0.018, which 
is close to the DiProPerm result. 

To illustrate our main idea, we show a toy example under different settings in Figure 1. 
The data in the top-left panel are generated from a Gaussian distribution and the data in the 
rest three panels come from a mixture of two Gaussian distributions with a small difference 
in the mean, 0.5A^(—I 2 ) -l- 0.5A^(^, I 2 ), where pi = (0.5,0)'. To ease the presentation. 


3 


the toy example is of two-dimensional, though the message applies to the HDLSS data. 
We show two signihcance analysis methods for the HDLSS data that inspire our approach, 
the DiProPerm test of Wichers et ah (2007) and Wei et ah (2015), and the Statistical 
Signihcance of Clustering method (SigClust) of Liu et ah (2008) and Huang et ah (2014). 
The DiProPerm test is applicable when all the class labels are known (see the different 
colors/marker-types in the top-right panel, where each component of the mixture distribution 
corresponds to one class.) In contrast, SigClust does not require any label information 
(bottom-left panel). The (empirical) p-value of the DiProPerm test turns out to be 0, which 
leads to a correct conclusion that the two classes are indeed from two distributions, whereas 
the SigClust method fails to hnd this important difference (p-value= 0.296). Our proposed 
SigPal approach is designed for the case shown in the bottom-right panel. Given some 
labeled data, SigPal can give a signihcant conclusion with p-value= 0.018, which is close to 
the DiProPerm result. All these three methods will be introduced or revisited in the next 
two sections. 

The rest of the article is organized as follows. In Section 2, we review the DiProPerm test 
and the SigClust method. Section 3 presents our proposed SigPal method. Some theoretical 
results are studied in Section 4 which emphasize the HDLSS setting. A comprehensive 
simulation study and real data case study are provided in Section 5. Section 6 gives some 
concluding remarks. The appendix is devoted to technical proofs. 

2 DiProPerm Test and SigClust Test 

In this section, we review two signihcance analysis methods, DiProPerm and SigClust. Both 
methods are specihcally designed for testing HDLSS data, although they may be applied to 
low-dimensional data as well. DiProPerm is used when all the class labels are fully observed 
while SigClust is applicable when the data set has no class label information. The hypotheses 
of both methods are slightly different. 


4 


2.1 DiProPerm Test 


In practice, permutation tests are often used for the purpose of testing the class difference, 
where the null distribution is mimicked by the empirical distribution of the statistic calcu¬ 
lated from many randomly permuted data sets. However, for high-dimensional data, some 
distance measure with direct permutation may not work. This is because when d 3> n, such 
distance measure will be mainly driven by the error aggregated over dimensions, rather than 
the true mean difference between classes. To address this issue, a three-step procedure called 
Direction-PROjection-PERMutation test (DiProPerm) was proposed in Wichers et ah (2007) 
and further studied in Wei et ah (2015) for the two-class setting. DiProPerm was designed 
for data with fully observed labels. It tests the null hypothesis of equality of distributions: 
Hq: the distributions of the two classes are the same, and 
Hi: the distributions of the two classes are not the same. 

Another item of interest is to test the weaker null hypothesis of equality of means: 

Hq: the distributions of the two classes have equal means, and 
Hq: the distributions of the two classes have different means. 

Algorithm 1. The DiProPerm test comprises three steps. 

1. Direction: a direction which is capable of separating the two classes is found, such as 
the classification direction from Support Vector Machine (SVM; Vapnik, 1995, Cortes 
and Vapnik, 1995), Distance Weighted Discrimination (DWD; Marron et ah, 2007), or 
their hybrids (Qiao and Zhang, 2015b, a). 

2. Projection: all the data vectors are projected to this direction so that a univariate 
statistic (such as the two-sample t-statistic or the mean difference) can be calculated. 

3. Permutation: all the data are randomly relabeled and the hrst two steps are repeated 
for Nperm times {Nperm may be 1000.) An empirical p-value is calculated to assess the 
statistical signihcance (the proportion of the statistics from the permutation set that 
are greater than that from the data). 


5 


In Figure 2, we illustrate how the p-value in DiProPerm is calculated, using the same data 
as shown in Figure 1. In the left plot, we perform a DiProPerm test with 1000 permutations. 
The test statistics calculated for the permutations are shown as the blue jitter points, while 
that for the original data is the green vertical line. Here the mean difference is chosen as 
the statistic. The greater the statistic is, the more signihcantly different the two classes 
are. Hence, the empirical p-value is calculated as the proportion of the statistics from the 
permutation set that are greater than that from the data, which is 0 in this case. 

DiProPerm: pval=0 SigClust: pval=0.296 SigPal: pval=0.018 





N = 1000 Bandwidth = 0.143 


N = 1000 Bandwidth = 0.006368 


0.60 0.65 0.70 0.75 0.80 

N = 500 Bandwidth = 0.007718 


Figure 2: Illustration of the calculation of the p-values for DiProPerm (left), SigClust (mid¬ 
dle) and SigPal (right). The test statistics for the permutation/simulation set are shown as 
blue jitter points, while those for the original data are shown as the green vertical lines. The 
empirical p-value for DiProPerm is the proportion of the statistics from the permutation set 
that are greater than that from the data, which is 0 in this case. The empirical p-values for 
both SigClust and SigPal are the proportions of the statistics from the simulation set that 
are less than that from the data, which are 0.296 and 0.018 respectively. 


The main idea of DiProPerm is to measure the difference between two high-dimensional 
data subsets by the difference between their 1-dimensional projections onto some appropriate 
direction. DiProPerm is a powerful test in many settings and it is a nonparametric procedure 
that does not have many assumptions. Note that here class labels are required to hnd the 
projection direction (by a classihcation method such as SVM or DWD) and to calculate the 
test statistic (t-statistic or mean difference). 


6 






2.2 SigClust 


SigClust, proposed by Liu et al. (2008) and improved by Huang et al. (2014), is a clustering 
evaluation tool for the HDLSS data which aims to answer the question whether clusters are 
really there. That is, it has the following hypotheses: 

Hf)-. the data are from a single Gaussian distribution, and 

Hi: the data are not from a single Gaussian distribution. 

There is no reference to the notion of class in the hypotheses above. SigGlust is based on 
the vision of cluster as a subset of the data that can be reasonably modeled as coming from 
a single Gaussian distribution (with some covariance matrix). The Gaussian assumption 
has been previously used by Sarle and Kuo (1993) and McLachlan and Peel (2004). Huang 
et al. (2014) mentioned that this assumption may lead to some important consequences. For 
example, it is possible that none of Gauchy, Uniform, or even t distributed data may be 
viewed as a single cluster in this sense. While it may seem to be a strong assumption, it is 
reasonable in the challenging HDLSS situation because it allows real HDLSS data analysis 
with wide use in bioinformatics applications (Ghandriani et al., 2009, Verhaak et ah, 2010). 

Assume that a data set {a;*, i = 1,..., n} is obtained from an unknown Gaussian distri¬ 
bution with covariance S, where Xi G is the observed covariates. The idea of SigGlust is 
to approximate the null distribution of a test statistic by simulating from a single Gaussian 
distribution that hts to the data. The p-value in SigGlust is taken to be the lower quantile 
of the null distribution, dehned by the test statistic from the original data. It is similar to 
the DiProPerm test except that it performs simulation instead of permutation and it relies 
on a multivariate statistic instead of a univariate statistic after projection. 

Specihcally, the 2-means Gluster Index (GI) is used as the test statistic. It is dehned as 
the sum of the within-cluster sums of squares about the cluster means, divided by the sum 
of squares about the overall mean. 


Cl = 


ELiE 


\\x ■ — II^ 

jeCk ^ II 


E n II 

7 = 1 ll^i - 


X 


where Ck denotes the sample index set of the kth. cluster and x^^"' represents the mean of 


7 



the fcth cluster, for k = 1,2. The smaller the Cl, the larger the proportion of the overall 
variation that is explained by the clusters. Note that no predehned class labels are needed 
when computing the Cl, as the cluster assignment Ck is obtained concurrently by a clustering 
algorithm. 

Here the simulation from the null distribution is a critical part. As Cl is location and 
rotation invariant, it is enough to work only with a Gaussian null distribution with a mean 
at the origin and a diagonal covariance matrix A. Hence, an essential part of the SigClust 
test is the estimation of the eigenvalues of the covariance matrix S. 

Algorithm 2. The SigClust procedure is summarized as follows. 

1. Initialization: obtain a two-cluster assignment {k = 2) from an application of a 
clustering algorithm, such as fc-means. The Cl is then calculated for the original data 
set based on the cluster assignment. 

2. Simulation: simulate data from the null distribution: (Xi, ■ ■ ■ ,Xd) are independent 
with Xj ~ N(0, Xj), where (Ai, ■ ■ ■ , A^) is an estimate of the eigenvalues (Ai, • ■ ■ , A^) 
of the covariance matrix S. Then calculate the corresponding Cl on each simulated 
data after performing clustering in the same manner as in the Initialization step. 

3. Testing: repeat the Simulation step for Nsim times to obtain an empirical distri¬ 
bution of Cl based on the null hypothesis {Nsim may be 1000). Then calculate the 
empirical p-value to assess the statistical signihcance (the proportion of the Cl from 
the simulation set that are less than the Cl from the original data.) 

The middle plot in Figure 2 illustrates how the p-value in SigClust is calculated. Similar 
to the DiProPerm case (left plot), the blue jitter plots are the statistics from the simulations, 
and the green vertical line is that for the original data. Recall that the smaller the statistic 
(chosen as the Cl) is, the more signihcantly different the two clusters are. Hence the empirical 
p-value is the proportion of the CPs from the simulation set that are less than that from the 
original data, which is 0.296 in this case. 


The covariance estimation in the Simulation step can be challenging, especially when 
the data are HDLSS. Althongh we only need to estimate the eigenvalnes of the covariance 


matrix, which greatly rednces the nnmber of parameters to be estimated, this problem is 


still not trivial in the HDLSS setting. Lin et al. (2008) nsed a hard-thresholding approach 
for eigenvalne estimation. In particnlar, they hrst estimate the backgronnd noise level nsing 
a robust variance estimate. Then those estimated eigenvalues smaller than the background 
noise level are replaced with the noise level, that is. 



where (Ai, • • • , A^) are the eigenvalues of the sample covariance matrix and ajf is the esti¬ 


mated background noise level. 

Huang et al. (2014) later showed that with eigenvalue estimation using hard-thresholding, 
SigClust can be seriously anti-conservative if the hrst eigenvalue is relatively large. They 
proposed a less-aggressive soft-thresholding variant which greatly improved the performance 
of SigClust. Specihcally, they use 



A detailed dehnition of r can be found in Huang et al. (2014). 

3 Significance Analysis for Partially Labeled Data 

In this section, we hrst state the background and hypotheses of our problem, followed by a 
presentation of our proposed method. 

3.1 Background and Hypotheses 

Consider a binary testing problem for a data set with the labeled data portion {{xi, yi), i = 
l,...n;}, and the unlabeled data portion {x^^j, j = 1, •'' All the a^ds and Xni+/s 


9 


are li-dimensional covariates and the class label Ui G { — 1,1}. The total sample size is 
n = rii + Uu- Let 9 = rii/n he the proportion of the labeled data, and 1 — 6^ is the proportion 
of the unlabeled data. We formulate our proposed SigPal procedure as a hypothesis testing 
problem with the following hypotheses: 

Hq: the data come from a single Gaussian distribution, and 

Hi: the conditional distributions of the two classes are different and hence the data are 
not from a single Gaussian distribution. 

It is worth comparing our alternative hypothesis with those of the DiProPerm and Sig- 
Glust tests. Since not every class label is observed, the notion of the class, as in the alternative 
hypothesis of DiProPerm, is moot or murky. Technically, our alternative hypothesis is nei¬ 
ther an intersection nor a union of the previous alternative hypotheses. In the framework 
of SigPal, there exists an underlying class label for each observation. We are interested in 
the difference in the conditional distributions of the data with respect to this underlying 
label. Our goal is to infer the significance of the otherwise fully observed data based on the 
partially labeled data with the help of the covariate information of the whole data. Lastly, 
the fact that the conditional distributions are different implies that the data are not from a 
single Gaussian distribution (but the converse is not true.) 

3.2 Proposed Method 

With different values of the proportion of the labeled data, 6, we may consider different ways 
to address the problem. When 6 is close to 1, which means the majority of the data have 
label information available, then one may just ignore the small amount of unlabeled data 
and perform a DiProPerm test on the labeled data portion only. When 6 is very close to 
0, which means the majority of the data are unlabeled, then one can simply apply SigGlust 
regardless of the few class labels. While we may lose some useful information from the data, 
such simplihcations effectively reduce the complexity of the problem. In this article, we 
are more interested in the case when 6 is not close to 0 or 1. We propose a Signihcance 
Analysis for Partially Labeled Data (SigPal) which makes use of the extra label information. 


10 



compared to the SigClust procedure. 


Algorithm 3. The SigPal procedure consists of the following three steps. 

1. Initialization: obtain the predicted class assignments for the unlabeled data by ap¬ 
plying a semi-supervised classihcation/clustering method to the full data set. A test 
statistic is then calculated for the original data based on both observed and predicted 
class labels (our choice is Cl in this article). 

2. Simulation: simulate data from the null distribution: (Ai, ■ ■ ■ ,Xd) are independent 
with Xj ~ A(0, Xj), where (Ai, ■ ■ ■ , A^) is an estimate of the eigenvalues (Ai, • ■ ■ , A^) 
of the covariance matrix S. Randomly place class labels to rii observations in the 
simulated set and then calculate the corresponding test statistic after performing semi- 
supervised classihcation/clustering in the same manner as in the Initialization step. 

3. Testing: repeat the Simulation step for Nsim times to obtain an empirical distribu¬ 
tion of the test statistic based on the null hypothesis. Calculate the empirical p-value 
(the proportion of the Cl’s from the simulated data that are less than that from the 
original data) to assess the statistical signihcance. 

The right plot in Figure 2 shows the p-value calulation for SigPal. Similar to the SigClust, 
the empirical p-value is the proportion of the CPs from the simulation set that are less than 
that from the original data, which is 0.018 in this case. 

To calculate the statistic in the Initialization step, we need to assign labels for the un¬ 
labeled portion. This is similar to the application of a clustering algorithm in SigClust. Such 
label assignment can be done either by modifying a classihcation method or by modifying a 
clustering algorithm. 

• While we could simply use a classiher trained from the labeled portion to predict the 
class label for the unlabeled portion, a semi-supervised classihcation method is more 
reasonable here since it takes the covariate information in the large number of unlabeled 
observations into account. Possible choices of the semi-supervised classihcation method 


11 


include Semi-Supervised Sparse Linear Discriminant Analysis (S'^LDA; Lu and Qiao, 
2015), transductive SVM (TSVM; Vapnik, 1998, Chapelle et al., 2006, Wang et al., 
2007), the large-margin based methods (Wang and Shen, 2007, Wang et ah, 2009) and 
the bootstrap method (Collins and Singer, 1999). S^LDA combines the classical linear 
discriminant analysis and a machine learning oriented technique, and takes advantage 
of the unlabeled data to boost the classihcation performance. 

• Similarly, though we could just run a clustering algorithm for the whole data set 
to assign labels, we would like to borrow the strength in the labeled data portion. A 
semi-supervised clustering algorithm, which identihes clusters with constraints imposed 
by known labels, would be more appropriate in this case. Possible semi-supervised 
clustering algorithms include constrained /c-means (COP-KMEANS; Wagstaff et al., 
2001), and others. COP-KMEANS allows a must-link constraint which specihes that 
certain instances have to be placed in the same cluster. 

Once the class/cluster labels are assigned, a test statistic can be calculated. Options 
include the Hotelling’s statistic, the Cl, and some one-dimensional statistics (such as 
two-sample t-statistic or mean difference) after projections as in DiProPerm. Cl is more 
favorable here since it is location and rotation invariant and can be efficiently computed. It 
also facilitates the comparison between SigPal and SigClust in our numerical studies. It can 
be shown that for certain low-dimensional examples, the Cl is equivalent to the two-sample 
t-statistic. 

Similar to SigClust, we use simulation in lieu of permutation, and make use of the soft- 
thresholding method (Huang et ah, 2014) for eigenvalue estimation. SigPal randomly labels 
some observations in the simulated data. This extra step is essential to mimic the true null 
distribution of the test statistic. 

As will be shown later, although SigClust has some power when the signal within the 
data is relatively large, it is substantially less powerful when the signal is weak. SigPal, 
on the other hand, has a great power in both cases. Secondly, when the data come from a 
mixture of two Gaussian distributions and the mean difference is large enough, the labeled 


12 


data may not provide additional boost in the power compared to SigClust. In this case, it 
may make sense to simply apply SigClust and ignore the label information. As will be seen 
in the later sections, the strength of SigPal lies on the usefulness of the labeled data: it is 
visibly more powerful than SigClust when the signal is small. 

4 Theoretical Property 

In this section, we provide some theoretical justihcation for the SigPal method. We hrst 
derive the relationship between the theoretical version of the Cl (TCI) and the eigenvalues 
of the covariance. Specihcally, we assume that X ~ N{0, S) and consider using S'^LDA for 
class assignment. The theoretical S'^LDA (Lu and Qiao, 2015) coefficient lj is dehned as 

LJ = argmin E(x,y)(T — {uj'X + 6))^ + C'Ex(l — \ijj' X + 6|)+, 

|]a;||=l,fe=0 

= argmin E(x,y)(P" — cj'X)^ + C'Ex(l — |<-j'X|)+, 

11 ^ 11=1 

where C > 0 is a constant. We consider the case where the effect of the unlabeled data 
portion dominates, that is, we let C —)■ oo. The relationship between the theoretical cluster 
index (TCI) for SigPal and the eigenvalues of S is stated in Theorem 1. 

Theorem 1. Suppose that X ~ A(0, S), P(F = +1) = P(F = —1) = 1/2 and the propor¬ 
tion of the labeled data is 6. Assume that E has an eigen-decomposition S = V'NV, where 
A = diag(Ai, ■ ■ ■ , A^) with Ai > A 2 > ■ ■ • > A^. Let vi he the first principal component 
direction. Then when C ^ 00 , uj = Vi, and the corresponding TCI is 

TCI = i + e--{i- —. 


Theorem 1 shows that given 6*, the optimal TCI only relies on the largest eigenvalue Ai 
and the sum of eigenvalues practice, the estimations of these two quantities have 

a critical impact on the p-values, dehned as the proportion of the CPs from the simulated 


13 



data that are less than that from the original data. In particular, let A* denote the estimate of 
Aj. Assume that 0 = 1/2 (for a mere illustration), then the difference between the true TCI 
(the one for the original distribution) and the TCI resulting from a Gaussian distribution 
with covariance A (the estimated A) is proportional to. 


E 


Ai 

Eti A. 


Ai 

Yh=i Ai 


For hard-thresholding method, dehne the potential biases in the estimation of Ai and ^ 
as 5i and A respectively. Then 


( 1 ) 


_ Ai -|- Ai ^ Ai _ Yli=i AiAi — AiA 

“ Eti A. + A “ Eti A. “ EtiA.(EtiA. + A)- 
The hard-thresholding method will tend to be anti-conservative when < 0, or AiA > 
(5i J2i=i Ail that is, when the hrst eigenvalue is large relative to the rest (assuming that A is 
positive, which is very likely for the hard-thresholding method since the smallest eigenvalues 
are replaced by the background noise level.) On the other hand, the soft-thresholding method 
is energy preserving in the sense that the sum of the soft eigenvalues is the sum of the sample 
eigenvalues, and thus A = 0. It follows that when Ai < Ai, that is, when the the largest 
eigenvalue is under-estimated, the soft-thresholding method will be anti-conservative. This 
happens when the hrst eigenvalue is only a little larger than the background noise. A detailed 
discussion about the results of hard-thresholding and soft-thresholding methods can be found 
in Huang et ah (2014). 

We further explore the impact of the estimation of Aj on the TCI in SigPal and Sig- 
Clust. Let TCIsigPai and TCIsigciust denote the TCTs of SigPal and SigClust respectively, 
and TCIsigPai and TCIsigciust the TCTs from Gaussian distributions based on estimated co- 
variance (that is, the simulated data in both SigPal and SigClust). By Theorem 3 of Huang 
et al. (2014), 


TCIgigQlugl 


1 - 


2 Al 


14 








Then when 0 = 1/2 (again, for an illustration), the differences are 


TCIsigPai — TCIsigciust — 1/2 + 


TCIsigPal — TCIsigClust — 1/2 + 


7 Ai 
7 Ai 


(2) 

(3) 


A SigPal/SigClust test gains power if TCI TCI, that is, the TCI for the original data 
is less than the TCI from simulated data. When ^ , TCIsinPai — TCIsinPai 

is greater than TCIsigciust — TCIsigciust due to (2) and (3). Therefore, SigPal is more 
powerful than SigClust when Particularly for soft-thresholding method, 

Ylt=i because it is energy preserving. When the first eigenvalue is very large 

relative to the others, it is easier to be over-estimated, in which case SigPal will be more 
powerful. 

It is also of interest to look at the change of the difference TCIsigPai — TCIsigciust with 
respect to 9 when the eigenvalues are hxed. Assume that =1/2, then we have the 

difference 

TCIsigPai - TCIsigciust = -[0^ - 302 + (tt + 3)0]. 

71 

We plot the difference in Figure 3, which shows that the difference is almost linear in 0. It 
also shows that when 0 = 0, the difference equals 0 too. This is because when the proportion 
of the labeled data is 0, then all of the data are unlabeled, in which case the data set is 
reduced to the SigClust setting. Hence, TCIsigPai = TCIsigciust- As 0 increases, labeled data 
play a more and more important role, and the difference of TCIsigPai and TCIsigciust increases 
almost linearly with respect to 0. 

In the next theorem, we study some asymptotic properties of SigPal. Since our main 
interest is in the HDLSS data, we choose to consider asymptotics for d —)■ cx) with n hxed. 
Such kind of HDLSS asymptotic properties were previously studied by Hall et ah (2005), 
Ahn et ah (2007), Liu et al. (2008), Jung and Marron (2009), Qiao et ah (2010), Jung et ah 
(2012), Qiao and Zhang (2015b), among others. 

Consider a mixture of two Gaussian distributions, ?7iV(0, D) -(- (1 — r 7 )iV(^i, D), where 


15 










Figure 3: The difference TCIsigPai — TCIsigciust with respect to 6 *, the proportion of the 
labeled data. 

T] G ( 0 , 1 ) is the proportion for the mixture, ^ = (oi, 
a diagonal matrix with diagonal elements Ai > A 2 > 
variance for the zth variable of the mixture is A* + 77(1 
are bounded. 

Theorem 2. Suppose that the data come from a mixture of two Gaussian distributions, 
riN{0, D) + (1 — D), where 77 G (0,1), ^ = (oi, ■ ■ ■ , a^)' and D is a diagonal ma¬ 

trix with diagonal elements Ai > A 2 > ■ ■ ■ > \d- Let ni and 712 be the sample sizes with 
min{ni,n 2 ) > 0 and ni n 2 = n > 3. Assume that Yl'j=i^j — 0{dh) with 0 < [d < 1, 
= 0{d), — 0{d?) with 7 < 2 and maXj{Xj + 77(1 — ri)a‘^) < M with 

M > 0 a fixed constant. If D is known, then for a fixed n, the corresponding SigPal p-value 
converges to 0 in probability as d —)■ 00 . 

Liu et ah (2008) studied a similar result for SigClust in a special case when oi = 02 = 
■ ■ ■ = Od = a, where a is a hxed constant. While we add some more assumptions, the theorem 
shows the asymptotic property for a more general setting. Theorem 2 shows that if the data 
come from a mixture of two Gaussian distributions, then under some assumptions, SigPal 


• • • , Od)' a constant vector and D is 

• • • > Arf. Note that the theoretical 
— 77 ) 0 ^. We assume that Ai and afs 


16 



tends to reject the null hypothesis when n is hxed and d —)■ oo. This result justihes the 
usefulness of SigPal in the HDLSS data setting. 

5 Numerical Study 

In this section we use simulated and real examples to demonstrate the effectiveness of our 
proposed method. 

5.1 Simulations 

We use the same simulation setting as in Liu et ah (2008) and Huang et ah (2014). Three 
types of examples are studied, including three cases under both the null and alternative 
hypotheses. The sample size for all examples is n = 40 and dimension d = 300. In the hrst 
case, we consider examples of data under the null hypothesis, that is, having one cluster 
generated from a single Gaussian distribution. In each example, we check the type-I error 
by studying how often SigPal incorrectly rejects the null hypothesis Hq. In the second and 
the third cases, we consider data from a collection of mixtures of two Gaussian distributions 
with different signal sizes and explore the power of our method in terms of how often it 
correctly rejects the null hypothesis. For simplicity, we consider diagonal covariance matrix 
D because of the rotation invariance property of GI. 

For each simulation, we consider two class assignment methods to be applied in SigPal, 
namely, S'^LDA and GOP-KMEANS. They are applied for both original data and simulated 
data before we calculate GI. Under the null hypothesis, theoretically the p-value should 
follow the Uniform [0,1] distribution. Then a level a test rejects the null hypothesis with 
probability a when D is known. This can be shown by a direct use of the standard probability 
integral transformation theorem. To simplify the computation, we hx the tuning parameters 
in F^LDA (the goal here is not perfect classihcation.) We also consider an option which uses 
Li-LDA for labeled data only on the simulated data while still using S'^LDA on the original 
data. Note that this does not affect the size of the test, however, could sacrihce the power. 


17 


V 

w 

Li-LDA 

^^LDA 

GOP-KMEANS 

SigGlust 

100 

1 

0 

3 

4 

0 

50 

2 

1 

4 

3 

0 

20 

5 

0 

1 

1 

0 

10 

10 

0 

2 

2 

0 

1 

1 

0 

0 

0 

0 

3 

1 

0 

0 

0 

0 

5 

1 

0 

0 

0 

0 

10 

1 

0 

0 

2 

0 

20 

1 

0 

2 

4 

0 

50 

1 

1 

3 

4 

0 

1 

5 

0 

0 

0 

0 

10 

5 

0 

1 

2 

0 

20 

5 

0 

1 

2 

0 

50 

5 

0 

2 

1 

0 


Table 1: Summary table for the one cluster case over 100 replications based on different 
methods under different settings (different pairs of v and w). The numbers of empirical 
p-values which are less than 0.05 are reported. 

We apply these different versions of SigPal to compare with SigClust in each case. To make 
the notation simple, we use Li-LDA to denote SigPal with S'^LDA on the original data and 
Li-LDA on the simulated data, and use S'^LDA and COP-KMEANS to denote SigPal with 
corresponding methods on both original and simulated data. 

5.1.1 Case 1: One Cluster 

Suppose that the data are generated from a single multivariate Gaussian distribution with 
covariance D, where D is diagonal with diagonal elements (n, ■ ■ ■ , n, 1, ■ ■ ■ ,1), that is, there 

- V 

W 

are w v's and {d — w) I’s. We randomly select 20 observations to be labeled from all the 40 
observations and consider 14 combinations of {v,w) as shown in Table 1. Each simulation is 
repeated 100 times. 

In Table 1, as expected, all methods maintain the size (fewer than 5% of the p-values are 
less than 0.05.) Li-LDA is very conservative for all the 14 settings as it almost never rejects 
the null. A possible explanation is that under the null hypothesis, that is, when the data 
are generated from a single Gaussian distribution, a semi-supervised classihcation method 


18 













like S'^LDA makes the class difference even smaller than applying Li-LDA on labeled data 
only (since the former attempts to incorporate the useless information.) As a consequence, 
the Cl from the original data (after applying S'^-LDA) is often greater than the Cl’s from 
the simulated data (after applying Li-LDA), and hence the p-value is often very large. 

5.1.2 Case 2: Mixture of Two Gaussian Distributions with Signal in One Co¬ 
ordinate Direction 

We now consider data generated from a mixture of two Gaussian distributions, 0.5A^(—D) + 
D), where = (a, 0, • • ■ , 0)' and D = diag(f, • • • , n, 0, • • • ,0) a diagonal matrix. 

V—^ 

w 

The sample size is n = 40. From each class, we randomly choose 10 observations which we 
keep class labels for. Two types of the covariance matrix are conducted here, v = 100, w = 1 
and V = 2, w = 50. The choices of a depend on the values of v and w. Note that when a = 0, 
the distribution reduces to a single Gaussian distribution. When a > 0 , the population is 
a mixture of two Gaussian distributions and the larger the a, the greater the separation 
between the two distributions is. When the signal a is large enough, labeled data do not 
help on distinguishing the two distributions (they may even make it worse.) Thus one may 
ignore the label information and simply apply SigGlust. In our study, we focus on the cases 
with small signals, in other words, when the mean difference of the two distributions is not 
too large. In these cases, labeled data can greatly help to gain extra power in SigPal. The 
empirical distributions of p-values are shown in Figure 4 and 5 for the two settings {v = 100, 
w = 1 and V = 2, w = 50). 

Figure 4 shows the setting v = 2 and tc = 50 and Figure 5 shows the spiked model setting 
V = 100 and w = 1. Golors are used to represent different values of a. When a = 0, the data 
are generated from a single Gaussian distribution. When a > 0, we study the power of the 
test using different methods. We consider a = 1, • • ■ , 5 for the setting v = 2 and w = 50 and 
a = 5,10,15,18, 20 for the setting v = 100 and w = 1. We can see in Figure 4 that SigGlust 
is too conservative when a = 1,2,3 and there is almost no power when a = 1,2. All the 
three SigPal methods present more power in these settings. 


19 



V=2,W=50,L1-LDA 


v=2,w=50,Cop-kmeans 



Figure 4: Empirical distributions of p-values of a mixture of two Gaussian distributions with 
the signal in one direction. Results are based on different methods under the setting v = 2 
and w = 50, with the increase of signal a. 


For the spiked model setting in Figure 5 where v = 100 and w = 1, SigClust is anti¬ 
conservative, indicated by the fact that the p-value has a higher chance of having a smaller 
value (the grey curve is above the 45 degree line.) On the other hand, R^LDA and COP- 
KMEANS are more powerful than SigClust. Li-LDA loses some power as it only uses the 
labeled data on the simulated data assignments. The comparison on the left two subhgures 
(Li-LDA versus F^LDA) also illustrates the effect of using a semi-supervised classihcation 
for label assignment compared to using a classihcation method. A greater power is retained 
as a consequence. 


20 
















v=100,w=1,L1-LDA 


v=100,w=1 ,Cop-kmeans 



v=100,w=1,S3LDA 





Figure 5: Empirical distributions of p-values of a mixture of two Gaussian distributions with 
the signal in one direction. Results are based on different methods under the setting v = 100 
and w = 1, with the increase of signal a. 

To make the simulation closer to the reality, we also use the Human Lung Carcinomas 
Microarray Dataset to obtain a more realistic covariance structure D. This data set was 
previously analyzed in Bhattacharjee et al. (2001). Liu et al. (2008) used this data as a test 
bed to demonstrate their proposed SigClust approach. We extract four biological groups in 
the data set, 20 pulmonary carcinoid samples (Carcinoid), 13 colon cancer metastasis samples 
(Colon), 17 normal lung samples (normal) and 6 small cell carcinoma samples (SmallCell), 
with a total of 56 samples. We remove the hrst three principal components of the data by 
reconstructing the covariance matrix using the remaining terms of the eigen-expansion. The 


21 


















resulting covariance D is used to generate the original data, and the estimated covariance is 
used to generate the simulated data in the Simulation step of SigPal. 

In terms of the signal, we consider the signal a on one direction ranging from 1, 2, • • • ,7. 
The empirical distributions of p-values are displayed in Figure 6. 

Figure 6 shows that all of S'^LDA, COP-KMEANS and SigClust are strongly anti¬ 
conservative under the null hypothesis (a = 0), which is not the case in either Figure 4 
or Figure 5. The data in Figure 6 is generated from a non-diagonal covariance matrix D 
while Figure 4 and 5 use a diagonal covariance D. Since Cl is rotation invariant, it suggests 
that these methods become anti-conservative due to the estimation of covariance matrix. 


L1-LDA 


Cop-kmeans 





a=0 

— a=l 

— a=2 

— a=3 

— a=4 
a=5 

- a=6 

— a=7 


Figure 6: Empirical distributions of p-values of a mixture of two Gaussian distributions, 
generated by the covariance matrix from the real data, with the signal in one direction. 


22 



















Cl under null for non-diagonal true D 


Cl under null for diagonal true D 



N = 1200 Bandwidth = 0.002149 

Cl under null for non-diagonal est D 



N = 1200 Bandwidth = 0.003583 

Cl under null for diagonal est D 


- L1_LDA 

- S3LDA 

- cop KM 

KM 




N = 1200 Bandwidth = 0.003231 N = 1200 Bandwidth = 0.003982 


Figure 7: Understanding why our method is anti-conservative for non-diagonal covariance 
by comparing with the diagonal case. 

To further understand the influence from the estimation of diagonal and non-diagonal 
covariances, we compare the distributions of Cl under null hypothesis in Figure 7. The left 
two plots show the distributions of Cl for the data generated from a non-diagonal covariance 
matrix (top) and for the data generated from the estimated covariance (bottom). It shows 
that the density curves of Cl for the simulated data (bottom) for F^LDA, COP-KMEANS 
and SigClust (red, blue and green curves) are all shifted to the right compared to the case 
with the true distribution (top). The Cl for the simulated data (bottom) is more likely to 
be greater than the Cl for the original data (top), which makes the p-value small more often 
and hence the three methods become anti-conservative. This is consistent with the result 


23 












we see in Figure 6. 

For the right two plots, we rotate and obtain a diagonal covariance matrix by eigen- 
decomposition and plot the empirical distributions of Cl for the data generated from the 
true diagonal covariance (top) and from its estimation (bottom). The density curves of the 
Cl for the simulated data (bottom) for S'^LDA, COP-KMEANS and SigClust almost remain 
in the same position as those using the true covariance; the curve for Li-LDA shifts greatly 
to the left. Based on the comparison between the left and the right panels, we confirm our 
previous Ending that S'^LDA, COP-KMEANS and SigClust become anti-conservative when 
D is non-diagonal due to the influence from covariance estimation. 


L1-LDA 


Cop-kmeans 



p-valule 

53LDA 



p-valule 


_o 

■q. 

E 

LU 



p-valule 


SigClust 



p-valule 


a=0 

- a=5 

- a=lO 

- a=l5 

- a=20 


Figure 8: Empirical distributions of p-values of a mixture of two Gaussian distributions, 
generated by a diagonal covariance matrix from the real data, with the signal in one direction. 


24 


















After conducting eigen-decomposition of the covariance matrix used in Figure 6, we 
obtain a diagonal D and use it to generate the data. New results are presented in Figure 
8. None of the methods is anti-conservative and all the three versions of SigPal are more 
powerful than SigClust when the signal a is relatively small (a = 5,10). One is recommended 
to rotate the data to form a diagonal covariance matrix before applying SigPal. 


V=2,W=50,L1-LDA 


v=2,w=50,Cop-kmeans 





a=0 

- a=0.1 

- a=0.15 

- a=0.2 

- a=0.25 

a=0.3 


Figure 9: Empirical distributions of p-values of a mixture of two Gaussian distributions with 
the signal in all directions. Results are based on different methods under the setting v = 2 
and w = 50, with the increase of signal a. 


25 




















5.1.3 Case 3: Mixture of Two Gaussian Distributions With Signal in All Co¬ 
ordinate Directions 

Similarly as in Figure 4, we see in Figure 9 that SigClust is too conservative when a = 0.1 
and 0.15. All the three SigPal methods perform more powerfully than SigClust when a is 
less than 0.25. For the spiked model in Figure 10 where v = 100 and w = 1, SigClust is 
anti-conservative. SigPal is more powerful than SigClust when a < 0.6. 

Now we further consider examples with signals in all coordinate directions. We generate 


v=100,w=1,L1-LDA 


v=100,w=1 ,Cop-kmeans 





a=0 

— a=0.2 
- a=0.3 

— a=0.4 
- a=0.5 

a=0.6 


p-valule 


p-valule 


Figure 10: Empirical distributions of p-values of a mixture of two Gaussian distributions 
with the signal in all directions. Results are based on different methods under the setting 
V = 100 and w = 1, with the increase of signal a. 


26 



















data from a mixture of two Gaussian distributions, 0.5A^(— D) + 0.5iV(^i, D), where fi = 
(a, ■ ■ ■ , a)' and D = diag(u, - ,0) a diagonal matrix. We keep the class labels for 

w 

10 observations from each class and still consider two covariance settings, v = 100, w = 1 
and V = 2, w = 50. The signal a in each direction is deliberately chosen to be very small, 
however, when all directions are combined together, the total signal can be very large. The 
empirical distributions of p-values calculated from 100 replications for the two settings are 
displayed in Figure 9 and 10. 

In summary, Sigpal maintains the size under the null distribution while SigClust is anti¬ 
conservative when the hrst eigenvalue is very large relative to the others. In all the cases 
when the signal between the two distributions is small, SigPal is relatively more powerful 
than SigClust due to the help from labeled data. Among the three versions of SigPal we 
consider in the simulation study, COP-KMEANS performs the best in most cases. When 
the data follows a distribution with non-diagonal covariance matrix, the test could be anti¬ 
conservative. Thus rotation of the data is recommended before SigPal is applied. 

5.2 Real Data Application 

In this section, we apply our method to the breast cancer data (BRCA) from The Cancer 
Genome Atlas Research Network, which has been studied by Fan et ah (2006) and Huang 
et ah (2014). The data include four subtypes: LumA, LumB, Her2 and Basal. The sample 
size is 348, among which there are 154 LumA, 81 LumB, 42 Her2 and 66 Basal. The number 
of genes used in the analysis is 4000 after filtering. For every possible pairwise combination 
of subclasses, we randomly select 20 observations from each class to keep the class labels 
and the remaining observations are treated as unlabeled. We apply SigClust, SigPal and 
DiProPerm to every possible pair of subclasses and report their p-values in Table 2. Here we 
only conduct SigPal using COP-KMEANS for class assignment in this real example. Note 
that these three methods are using different information. SigClust does not require label 
information while DiProPerm is applied to the two labeled classes. Our SigPal method is 
designed for partially labeled data. 


27 




Basal.LumA 

Basal. LumB 

Basal.Her2 

LumA.LumB 

Her2.LumB 

Her2.LumA 

SigClust 

0 

0 

0.009 

0.298 

0.537 

0.625 

SigPal 

0 

0 

0 

0.002 

0.002 

0.013 

DiProPerm 

0 

0 

0 

0 

0 

0 


Table 2: SigClust, SigPal and DiProPerm p-values for each pair of subtypes for the BRCA 
data. With the label information, DiProPerm can always reach signihcant results for all 
pairs. With only partial information, SigPal can reach similar conclusions. 


Table 2 shows that for pairs including Basal, the p-values from all three methods are 0 
which implies that Basal can be well separated from the rest. For the remaining three pairs, 
SigClust reports large p-values, which suggests that there is no strong evidence for them to 
be viewed as from two different clusters if no label information is provided. In contrast, the 
p-values of these three pairs for DiProPerm are all 0, indicating that each pair of two classes 
can be signihcantly separated. With the help of a small portion of the label information, 
SigPal gains much power to distinguish the two classes and produces very close results to 
DiProPerm. 

To further illustrate the three methods, we use the pair of LumA and LumB as an 
example and display the scatter plot of the projections of the data vectors onto the hrst two 


PC1 & PC2 projection scatter plot for 'LumA' and 'LumB' 



SigClust: pval=0.298 


SigPal: pval=0.002 


DiProPerm: pval=0 


Figure 11: PCA projection scatter plot of the two classes, LumA and LumB. Colors indicate 
biological subtypes. LumA are displayed in red and LumB in Blue. Points in black are 
treated as unlabeled data. Data are analyzed by SigClust, SigPal and DiProPerm respec¬ 
tively in the left, middle and right panels. 








principle component directions in Figure 11. Colors indicate biological subtypes. LumA are 
displayed in red and LumB in Blue. Points in black are treated as unlabeled data. Figure 
11 shows that without given the class information (left plot), LumA and LumB seem to be 
one subtype so that SigClust give a non-signihcant result (p-value=0.298). When all the 
label information is available (right plot), DiProPerm suggests that these two classes are 
signihcantly different (p-value=0). With 40 labeled observation out of 235 observations in 
total, our SigPal method hnds the difference between the two classes by extracting useful 
information from the small portion of the labeled data (middle plot). 

6 Conclusion 

In this article, we propose a signihcance analysis procedure, SigPal, in the HDLSS setting. 
This method is designed for a data set where a small amount of labeled data are available with 
a large amount of unlabeled data. In contrast to SigClust which does not rely on class label 
information, our method makes use of the labeled data to increase the difference in the classes 
under the null and alternative hypotheses. Through extensive simulation examples with 
partial label information available, we compared the performance of SigPal with SigClust 
in different settings. SigPal is relatively more powerful than SigClust, especially when the 
signal between the two classes is not large. Among the three versions of SigPal we conduct 
in the simulation study, COP-KMEANS performs the best in most cases. 

Although Cl is rotation invariant, it turns out in the simulation study that under the null 
hypothesis when the data comes from a distribution with non-diagonal covariance matrix, 
SigPal could be anti-conservative. Hence, rotation of the data is recommended before SigPal 
is applied. 

SigPal is a general procedure with possibly many variants. The test statistic Cl, used 
in our numerical study, may be substituted by other quantities, such as the Hotelling’s 

statistic. There is also room for choosing different approaches to assign labels in the 
Initialization step of SigPal. An interesting and potential extension of SigPal is to case of 
testing multiple classes. A possible solution is to use a multi-class classihcation method for 


29 


the class assignment. 


Appendix. Technical proofs 


Proof of Theorem 1 

Withont loss of generality, assnme that E = diag(Ai, • • • , A^) with Ai > A2 > ■ ■ ■ > A^. We 
hrst show that cl; = = (1, 0, • • ■ , 0)', which is the direction of the greatest variation of the 

data. Recall that ch = argmin||(_^;|j^]^ E(x,y)(R — + C'Ex(l — |ci;'X|)+. As C —>■ 00 , 

we only need to show that ni = (1, 0, • • • , 0)' minimizes the second term 


E[(1-|Z|)+] 

=P(|Z| > 1)E(0||Z| > 1) + P(|Z| < 1)E(1 - |Z|||Z| < 1) 
=0 + P(|Z| < 1)E(1- \Z\\\Z\ < 1), 


where Z = uj'X. 

Let uji = (1,0, ■■■ ,0)' and UJ2 = (si, • ‘ ‘ y^d)' with — 1- Then we have 

Var(ci;'^X) = = Ai > Vai{u: 2 X). Let Zi = u}[X and Z 2 = ^' 2^1 then Z\ and 

Z 2 follow Ganssian distribntions with mean 0 and Var(Zi) > Var(Z 2 ). It follows that 

1. P(|Zi|<l)<P(|Z2|<l); 

2. E(|Zi|||Zi| < 1) > E(|Z2|||Z2| < 1). 

Therefore, E[(l — |2’i|)+] < E[(l — |Z 2 |)+], which implies that ch = (1, 0, • • • , 0)'. 

To show the TCI, it is enongh to consider the situation with diagonal covariance matrix 
due to the rotation invariance of CL We hrst compute the theoretical total sum of squares 
TSS as 


= E||X|p = / \\xf(i){x)dx 


\x\\‘^(f){x)dxi ■ ■ ■ dxd 


' —00 t/ —CKD 


30 


/ CXD poo ^ ^ 

■J z-Kn 

-OO J-OO 

d pQQ d 

J2 / ixj)dxj = J2 

j = l d-OO 


<~P\j{xj))dxi ■■■dxd 


where <i){x) = ]\^j^i^x,{xj) = U^i '^/tx/ 

Recall that we assume Ai > A 2 > ■ ■ ■ > and we showed d> = Ui = (1, 0, • ■ ■ , 0)'. Here 
(1, 0, • ■ ■ , 0)' is the norm vector of the separating hyperplane going through ^ = (0, • ■ ■ , 0)'. 

Let WSS be the theoretical within cluster sum of square and let WSSi and WSS 2 denote 
the theoretical sum of squares within class 1 and class 2 respectively. By symmetry the mean 
of class 1 Hi = (/ill, hi 2 , • • ■ , Hid)' with fiu = His = ■ ■ ■ = Hid = 0. For the first dimension, 
note that class 1 contains the original labeled data with mean 0 with probability 6, and the 
original unlabeled data assigned to class 1 with mean 2 xi(fxi(xi)dxi with probability 
(1 — 6'), where d is the proportion of the labeled data. Thus we have 


/■OO 2\ 

Hu = {1-9) -2 / xi(pxi{xi)dxi + 0 ■ 0 = (1 - 0)y 


So Hi = ((1 -d)\ ,0'). Similarly, H 2 = (“(1 


- ,0)'. Then 


IF^^i = (1 - 0) 


/ —OO 

^00 


' —OO 

poo 


= {l-9) 


||a; — Hi\\‘^4>{^)dxi ■ ■ ■ dxd 
■■ / \\x — Hi\\‘^(l>{^)dxi ■ ■ ■ dxd 

J —OO 

12\ s 2 roo nco 

(^xi - (1 - d)J —J (pxAxi)dxi + {1-9)^ / 

^ 2_ ry d 0 d — 


i=2 -'O 


X^;:(f){x)dXl 


/ OO , I \ 2 ^ HOO POO POO 

(xi-{l-9)J—-j ipxdxi)dxi + 9'^ / •■■/ x](t){x)dxi-■ ■ dxd 


^(0 + 1)+ -(0^-302 + 30-1) 

2 TT 


, Al + 0, 
-^1 + ^ 


1=2 


■dxd 


31 



Similarly, WSS 2 = WSSi. Thus, 


TCI = 


WSSi + WSS 2 

ms 


1 + ^ 

EL A, 


□ 


Proof of Theorem 2 


The proof is similar to the proof of Theorem 1 in Liu et ah (2008). It is sufficient to show 
the following two points: 

1. The Cl .^1 of the data from the mixture of two Gaussian distributions, using the sources 
of each observation from the two Gaussian distributions as cluster assignments, converges to 
0 in probability as d —)• 00 . 

2. The Cl under the null hypothesis is bounded away from 0 as d —>■ cxd. 

Point 1 can be shown by introducing a new data set, which is easier to work with, and a 
modihed corresponding Cl, ^ 2 - In particular, consider iid sample Hi, - ■ ■ from A^(0, D). 
Note that Xi = + Sf-i, where d = 0 if a;* comes from A^(0,D) and 1 if Xi comes from 

7V(^,D). Xi = implies that .^1 = .^ 2 - Let C(i) and C( 2 ) denote the sample index sets 

of Xi with d = 0 and d = 1 respectively. By dehnition, we have 

'2 V - 11 .. _^( fc )||2 


ei = 


Z-^k=l Z— 


£Ct 


(fc) 


Xi 


^2 can be written using y^, 
6 = 


, y^, Oi, • ■ ■ ) nnd d as 

ELi Eiec, 


< 


E n II — 

i=i Wvi - y 


Er=i \\y^ - y\V + ^ ^ - vT) 

2 






nin2 


EL“? + L“EL%(L’-!'f) 




3 


nin2 2p-1 i 2nin2 /-(I) 

— 2^j=i yyj ~ yj 




(4) 


where Ci and C 2 denote the random grouping indices of the sample into two classes of size 
rii and ^ 2 , yj and yj^'^ denote the overall sample mean and the sample mean of class k of the 
jth variable. Note that Er=i(dp ~ ~ ~ 1)5 Ej=i E = O(d^) and /3 < 1. Thus 


32 














both the mean and variance of the numerator in (4) converge to 0 as d —)■ cx), which implies 
that the numerator of (4) converges to 0 in probability. 

For the denominator of (4), since Yl'j=i ~ 0{d), the hrst term converges to a constant 
as d —)■ cxD. For the second term, note that ~ ~ + i^))- 

Because J2'j=i = 0{(P) with 7 < 2 , the second term of the denominator of (4) converges 
to 0 in probability as d —)■ 00 . Therefore, ^2 —t 0 in probability as d —>■ 00 , which implies 
^1 —)■ 0 in probability as d —)■ cx). 

To show point 2, We hrst get the Gaussian null distribution of the mixture as A^(0, D*), 
where D* is diagonal with the jth diagonal element \j + 7(1 — 7 ) 0 ^. Here we simply assume 
the null distribution with mean 0 since Cl is location invariant. Let Zi, • ■ ■ , z„ be a sample 
from the Gaussian null distribution. We want to show that the corresponding Cl is bounded 
away from 0 as d —)■ 00 . To this end, we make use of the HDLSS geometry of Hall et ah 
(2005). We hrst check the three assumptions: 

( 1 ) the fourth moments of all entries of 2 are uniformly bounded, by the assumption that 


ma.Xj{Xj + 7(1 — T])a‘j) < M; 

(2) hmrf_^oo trace(D*)/d = cx^, where is a constant. This is because lim^-^oo trace(D*)/d = 

limd^oo + d(l - v)(^])/d = 7(1 - vW = where = lim^^oo Y.'j=i 

(3) the random vector satishes the p-mixing condition by the independence among the 
entries of 2 . 

Then it follows that || 2 j — = 2da^ + Op{l), as d —)• cx). By the triangle inequality. 


we have 2 (|| 2 j — z||^ + \\zi — z|p) > (Hz* — z|| + \\zi — z||)^ > ||zj — z^jp and ||Zj — z|| = 
n — l)zj — — n 11^* ~ ^^11- Thus, as d —)■ cx, we can bound the Cl under 

the null hypothesis as 


^([ni/2] + [u2/2])2dcr2 + Op(l) 

^2da^ + 0 ,( 1 ) 

n([ni/ 2 ] + [7x2/2]) 

1\2 ' 


33 










where [m] denotes the largest integer smaller than u. This shows that under the null hypoth¬ 
esis Cl is bounded away from 0 as d —)■ cx). □ 

References 

Ahn, J., Marron, J., Muller, K., and Chi, Y. (2007), “The high-dimension, low-sample-size 
geometric representation holds under mild conditions,” Biometrika, 94, 760. 

Bai, Z. and Saranadasa, H. (1996), “Effect of high dimension: by an example of a two sample 
problem,” Statistica Sinica, 6, 311-330. 

Bhattacharjee, A., Richards, W. G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Be- 
heshti, J., Bueno, R., Gillette, M., et ah (2001), “Classihcation of human lung carcinomas 
by mRNA expression prohling reveals distinct adenocarcinoma subclasses,” Proceedings of 
the National Academy of Sciences, 98, 13790-13795. 

Chandriani, S., Frengen, E., Cowling, V. H., Pendergrass, S. A., Perou, C. M., Whitheld, 
M. L., and Cole, M. D. (2009), “A core MYC gene expression signature is prominent in 
basal-like breast cancer but only partially overlaps the core serum response,” PLoS One, 
4, e6693. 

Chapelle, O., Scholkopf, B., Zien, A., et al. (2006), Semi-supervised learning, MIT press 
Cambridge. 

Chen, S. and Qin, Y. (2010), “A two-sample test for high-dimensional data with applications 
to gene-set testing,” The Annals of Statistics, 38, 808-835. 

Collins, M. and Singer, Y. (1999), “Unsupervised models for named entity classihcation,” in 
Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language 
Processing and Very Large Corpora, pp. 189-196. 

Cortes, C. and Vapnik, V. (1995), “Support-vector networks,” Machine learning, 20, 273- 
297. 


34 



Dempster, A. P. (1960), “A significance test for the separation of two highly multivariate 
small samples,” Biometrics^ 16, 41-50. 

Fan, C., Oh, D. S., Wessels, L., Weigelt, B., Nuyten, D. S., Nobel, A. B., van’t Veer, L. J., 
and Perou, C. M. (2006), “Concordance among gene-expression-based predictors for breast 
cancer,” New England Journal of Medicine, 355, 560-569. 

Hall, P., Marron, J. S., and Neeman, A. (2005), “Geometric representation of high dimen¬ 
sion, low sample size data,” Journal of the Royal Statistical Society: Series B (Statistical 
Methodology), 67, 427-444. 

Huang, H., Liu, Y., Yuan, M., and Marron, J. (2014), “Statistical signihcance of clustering 
using soft thresholding,” Journal of Computational and Graphical Statistics, 00-00. 

Jung, S. and Marron, J. (2009), “PGA consistency in high dimension, low sample size con¬ 
text,” The Annals of Statistics, 37, 4104-4130. 

Jung, S., Sen, A., and Marron, J. (2012), “Boundary behavior in high dimension, low sample 
size asymptotics of PCA,” Journal of Multivariate Analysis, 109, 190-203. 

Land, W. H., Ma, X., Barnes, E., Qiao, X., Heine, J., Masters, T., and Park, J. W. (2012), 
“PNN/GRNN ensemble processor design for early screening of breast cancer,” Procedia 
Computer Science, 12, 438-443. 

Liu, Y., Hayes, D. N., Nobel, A., and Marron, J. (2008), “Statistical significance of clustering 
for high-dimension, low-sample size data,” Journal of the American Statistical Association, 
103. 

Lu, Q. and Qiao, X. (2015), “Sparse Fisher’s Linear Discriminant Analysis for Partially 
Labeled Data,” arXiv, 1509.05438. 

Marron, J., Todd, M., and Ahn, J. (2007), “Distance-weighted discrimination,” Journal of 
the American Statistical Association, 102, 1267-1271. 


35 



McLachlan, G. and Peel, D. (2004), Finite mixture models, John Wiley & Sons. 

McShane, L. M., Radmacher, M. D., Freidlin, B., Yu, R., Li, M.-C., and Simon, R. (2002), 
“Methods for assessing reproducibility of clustering patterns observed in analyses of mi¬ 
croarray data,” Bioinformatics, 18, 1462-1469. 

Qiao, X., Zhang, H., Liu, Y., Todd, M., and Marron, J. (2010), “Weighted distance weighted 
discrimination and its asymptotic properties,” Journal of the American Statistical Asso¬ 
ciation, 105, 401-414. 

Qiao, X. and Zhang, L. (2015a), “Distance-weighted Support Vector Machine,” Statistics 
and Its Interface, 8, 331-345. 

— (2015b), “Flexible High-dimensional Classihcation Machines and Their Asymptotic Prop¬ 
erties,” Journal of Machine Learning Research, forthcoming. 

Sarle, W. and Kuo, A.-H. (1993), “The MODECLUS procedure,” SASTechnical Report P- 
256. SAS Institute, Cary, North Carolina. 

Schaffer, J. D., Park, J. W., Barnes, E., Lu, Q., Qiao, X., Deng, Y., Li, Y., and Land Jr, 
W. H. (2012), “GRNN ensemble classiher for lung cancer prognosis using only demographic 
and TNM features,” Procedia Computer Science, 12, 450-455. 

Schott, J. (2007), “Some high-dimensional tests for a one-way MANOVA,” Journal of Mul¬ 
tivariate Analysis, 98, 1825-1839. 

Srivastava, M. (2007), “Multivariate theory for analyzing high dimensional data,” Journal 
of the Japan Statistical Society, 37, 53-86. 

Srivastava, M. S. and Du, M. (2008), “A test for the mean vector with fewer observations 
than the dimension,” Journal of Multivariate Analysis, 99, 386-402. 

Srivastava, M. S. and Fujikoshi, Y. (2006), “Multivariate analysis of variance with fewer 
observations than the dimension,” Journal of Multivariate Analysis, 97, 1927-1940. 


36 



Suzuki, R. and Shimodaira, H. (2006), “Pvclust: an R package for assessing the uncertainty 
in hierarchical clustering,” Bioinformatics, 22, 1540-1542. 

Tibshirani, R. and Walther, G. (2005), “Cluster validation by prediction strength,” Journal 
of Computational and Graphical Statistics, 14, 511-528. 

Vapnik, V. (1995), The Nature of Statistical Learning Theory, Springer. 

— (1998), Statistical Learning Theory, Wiley. 

Verhaak, R. G., Hoadley, K. A., Purdom, E., Wang, V., Qi, Y., Wilkerson, M. D., Miller, 
C. R., Ding, L., Golub, T., Mesirov, J. P., et ah (2010), “Integrated genomic analysis 
identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in 
PDGFRA, IDHl, EGFR, and NFl,” Cancer cell, 17, 98-110. 

Wagstaff, K., Cardie, C., Rogers, S., Schrodl, S., et ah (2001), “Constrained k-means clus¬ 
tering with background knowledge,” in ICML, vol. 1, pp. 577-584. 

Wang, J. and Shen, X. (2007), “Large margin semi-supervised learning,” Journal of Machine 
Learning Research, 8, 1867-1891. 

Wang, J., Shen, X., and Pan, W. (2007), “On transductive support vector machines,” Con¬ 
temporary Mathematics, 443, 7-20. 

— (2009), “On efficient large margin semisupervised learning: Method and theory,” Journal 
of Machine Learning Research, 10, 719-742. 

Wei, S., Lee, C., Wichers, L., Li, G., and Marron, J. (2015), “Direction-Projection- 
Permutation for High Dimensional Hypothesis Tests,” Journal of Computational and 
Graphical Statistics, forthcoming. 

Wichers, L., Lee, C., Costa, D., Watkinson, P., and Marron, J. (2007), “A functional data 
analysis approach for evaluating temporal physiologic responses to particulate matter,” 
Tech, rep.. Tech. Rep. 5, University of North Carolina at Chapel Hill, Department of 
Statistics and Operations Research. 


37 



